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Abstract 

As examples such as the Monty Hall puzzle show, applying conditioning to up- 
date a probability distribution on a "naive space" , which does not take into account 
the protocol used, can often lead to counterintuitive results. Here we examine why. 
A criterion known as CAR ("coarsening at random") in the statistical literature 
characterizes when "naive" conditioning in a naive space works. We show that 
the CAR condition holds rather infrequently, and we provide a procedural char- 
acterization of it, by giving a randomized algorithm that generates all and only 
distributions for which CAR holds. This substantially extends previous charac- 
terizations of CAR. We also consider more generalized notions of update such as 
Jeffrey conditioning and minimizing relative entropy (MRE). We give a general- 
ization of the CAR condition that characterizes when Jeffrey conditioning leads to 
appropriate answers, and show that there exist some very simple settings in which 
MRE essentially never gives the right results. This generalizes and interconnects 
previous results obtained in the literature on CAR and MRE. 

1 Introduction 

Suppose an agent represents her uncertainty about a domain using a probability distribu- 
tion. At some point, she receives some new information about the domain. How should 
she update her distribution in the light of this information? Conditioning is by far the 
most common method in case the information comes in the form of an event. However, 
there are numerous well-known examples showing that naive conditioning can lead to 
problems. We give just two of them here. 

Example 1.1: The Monty Hall puzzle [Mosteller 1965; vos Savant 1990]: Suppose that 
you're on a game show and given a choice of three doors. Behind one is a car; behind 
the others are goats. You pick door 1. Before opening door 1, Monty Hall, the host (who 
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knows what is behind each door) opens door 3, which has a goat. He then asks you if you 
still want to take what's behind door 1, or to take what's behind door 2 instead. Should 
you switch? Assuming that, initially, the car was equally likely to be behind each of the 
doors, naive conditioning suggests that, given that it is not behind door 3, it is equally 
likely to be behind door 1 and door 2. Thus, there is no reason to switch. However, 
another argument suggests you should switch: if a goat is behind door 1 (which happens 
with probability 2/3), switching helps; if a car is behind door 1 (which happens with 
probability 1/3), switching hurts. Which argument is right? | 

Example 1.2: The three-prisoners puzzle [Bar-Hillel and Falk 1982; Gardner 1961; 
Mosteller 1965]: Of three prisoners a, b, and c, two are to be executed, but a does 
not know which. Thus, a thinks that the probability that i will be executed is 2/3 for 
i G {a, b, c}. He says to the jailer, "Since either b or c is certainly going to be executed, 
you will give me no information about my own chances if you give me the name of one 
man, either b or c, who is going to be executed." But then, no matter what the jailer 
says, naive conditioning leads a to believe that his chance of execution went down from 
2/3 to 1/2. | 

There are numerous other well-known examples where naive conditioning gives what 
seems to be an inappropriate answer, including the two-children puzzle [Gardner 1982; 
vos Savant 1996; vos Savant 1994] and the second-ace puzzle [Freund 1965; Shafer 1985; 
Halpern and Tuttle 1993]. 1 

Why does naive conditioning give the wrong answer in such examples? As argued in 
[Halpern and Tuttle 1993; Shafer 1985], the real problem is that we are not conditioning 
in the right space. If we work in a larger "sophisticated" space, where we take the protocol 
used by Monty (in Example 1.1) and the jailer (in Example 1.2) into account, conditioning 
does deliver the right answer. Roughly speaking, the sophisticated space consists of all 
the possible sequences of events that could happen (for example, what Monty would 
say in each circumstance, or what the jailer would say in each circumstance), with their 
probability. 2 However, working in the sophisticated space has problems too. For one 
thing, it is not always clear what the relevant probabilities in the sophisticated space are. 
For example, what is the probability that the jailer says b if b and c are to be executed? 
Indeed, in some cases, it is not even clear what the elements of the larger space are. 
Moreover, even when the elements and the relevant probabilities are known, the size of 
the sophisticated space may become an issue, as the following example shows. 

Example 1.3: Suppose that a world describes which of 100 people have a certain disease. 
A world can be characterized by a tuple of 100 0s and Is, where the ith component is 

1 Both the Monty Hall puzzle and the two-children puzzle were discussed in Ask Marilyn, Marilyn 
vos Savant's weekly column in "Parade Magazine". Of all Ask Marilyn columns ever published, they 
reportedly [vos Savant 1994] generated respectively the most and the second-most response. 

2 The notions of "naive space" and "sophisticated space" will be formalized in Section 2. This intro- 
duction is meant only to give an intuitive feel for the issues. 
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1 iff individual i has the disease. There are 2 100 possible worlds. Further suppose that 
the "agent" in question is a computer system. Initially, the agent has no information, 
and considers all 2 100 worlds equally likely. The agent then receives information that is 
assumed to be true about which world is the actual world. This information comes in 
the form of statements like "individual % is sick or individual j is healthy" or "at least 7 
people have the disease". Each such statement can be identified with a set of possible 
worlds. For example, the statement "at least 7 people have the disease" can be identified 
with the set of tuples with at least 7 Is. For simplicity, assume that the agent is given 
information saying "the actual world is in set U" , for various sets U. Suppose at some 
point the agent has been told that the actual world is in Lq, . . . , C/ n . Then, after doing 
conditioning, the agent has a uniform probability on Lq D . . . D U n . 

But how does the agent keep track of the worlds it considers possible? It certainly 
will not explicitly list them; there are simply too many. One possibility is that it keeps 
track of what it has been told; the possible worlds are then the ones consistent with 
what it has been told. But this leads to two obvious problems: checking for consistency 
with what it has been told may be hard, and if it has been told n things for large n, 
remembering them all may be infeasible. In situations where these two problems arise, 
an agent may not be able to condition appropriately. I 

Example 1.3 provides some motivation for working in the smaller, more naive space. Ex- 
amples 1.1 and 1.2 show that this is not always appropriate. Thus, an obvious question 
is when it is appropriate. It turns out that this question is highly relevant in the statis- 
tical areas of selectively reported data and missing data. Originally studied within these 
contexts [Rubin 1976; Dawid and Dickey 1977], it was later found that it also plays a 
fundamental role in the statistical work on survival analysis [Kleinbaum 1999]. Building 
on previous approaches, Heitjan and Rubin [1991] presented a necessary and sufficient 
condition for when conditioning in the "naive space" is appropriate. Nowadays this so- 
called CAR (Coarsening at Random) condition is an established tool in survival analysis. 
(See [Gill, van der Laan, and Robins 1997; Nielsen 1998] for overviews.) We examine 
this criterion in our own, rather different context, and show that it applies rather rarely. 
Specifically, we show that there are realistic settings where the sample space is structured 
in such a way that it is impossible to satisfy CAR, and we provide a criterion to help 
determine whether or not this is the case. We also give a procedural characterization of 
the CAR condition, by giving a randomized algorithm that generates all and only distri- 
butions for which CAR holds, thereby solving an open problem posed in [Gill, van der 
Laan, and Robins 1997]. 

We then show that the situation is worse if the information does not come in the 
form of an event. For that case, several generalizations of conditioning have been pro- 
posed. Perhaps the best known are Jeffrey conditioning [Jeffrey 1968] (also known as 
Jeffrey's rule) and Minimum Relative Entropy (MRE) Updating [Kullback 1959; Csiszar 
1975; Shore and Johnson 1980] (also known as cross- entropy). Jeffrey conditioning is 
a generalization of ordinary conditioning; MRE updating is a generalization of Jeffrey 



3 



conditioning. 

We show that Jeffrey conditioning, when applicable, can be justified under an appro- 
priate generalization of the CAR condition. Although it has been argued, using mostly 
axiomatic characterizations, that MRE updating (and hence also Jeffrey conditioning) is, 
when applicable, the only reasonable way to update probability (see, e.g., [Csiszar 1991; 
Shore and Johnson 1980]), it is well known that there are situations where applying MRE 
leads to paradoxical, highly counterintuitive results [Hunter 1989; Seidenfeld 1986; van 
Fraassen 1981]. 

Example 1.4: Consider the Judy Benjamin problem [van Fraassen 1981]: Judy is lost 
in a region that is divided into two halves, Blue and Red territory, each of which is 
further divided into Headquarters Company area and Second Company area. A priori, 
Judy considers it equally likely that she is in any of these four quadrants. She contacts 
her own headquarters by radio, and is told "I can't be sure where you are. If you are 
in Red territory, the odds are 3:1 that you are in HQ Company area ..." At this point 
the radio gives out. MRE updating on this information leads to a distribution where the 
posterior probability of being in Blue territory is greater than 1/2. Indeed, if HQ had 
said "If you are in Red territory, the odds are a : 1 that you are in HQ company area 
. . .", then for all a^l, according to MRE updating, the posterior probability of being 
in Blue territory is always greater than 1/2. | 

In [Grove and Halpern 1997], a "sophisticated space" is provided where conditioning 
gives what is arguably the more intuitive answer in the Judy Benjamin problem, namely 
that if HQ sends a message of the form "if you are in Red territory, then the odds are 
a : 1 that you are in HQ company area" then Judy's posterior probability of being 
in each of the two quadrants in Blue remains at 1/4. Seidenfeld [1986], strengthening 
results of Friedman and Shimony [1971], showed that there is no sophisticated space in 
which conditioning will give the same answer as MRE in this case. (See also [Dawid 
2001] for similar results along these lines.) We strengthen these results by showing 
that, even in a class of much simpler situations (where Jeffrey conditioning cannot be 
applied), using MRE in the naive space corresponds to conditioning in the sophisticated 
space in essentially only trivial cases. These results taken together show that generally 
speaking, working with the naive space, while an attractive approach, is likely to give 
highly misleading answers. That is the main message of this paper. 

We remark that, although there are certain similarities, our results are quite different 
in spirit from the well-known results of Diaconis and Zabell [1986]. They considered when 
a posterior probability could be viewed as the result of conditioning a prior probability 
on some larger space. By way of contrast, we have a fixed larger space in mind (the 
"sophisticated space"), and are interested in when conditioning in the naive space and 
the sophisticated space agree. 

It is also worth stressing that the distinction between the naive and the sophisticated 
space is entirely unrelated to the philosophical view that one has of probability and how 
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one should do probabilistic inference. For example, the probabilities in the Monty Hall 
puzzle can be viewed as the participant's subjective probabilities about the location of 
the car and about what Monty will say under what circumstances; alternatively, they can 
be viewed as "frequentist" probabilities, inferred from watching the Monty Hall show on 
television for many weeks and then setting the probabilities equal to observed frequencies. 
The problem we address occurs both from a frequentist and from a subjective stance. 

The rest of this paper is organized as follows. In Section 2 we formalize the notion of 
naive and sophisticated spaces. In Section 3, we consider the case where the information 
comes in the form of an event. We describe the CAR condition and show that it is violated 
in a general setting of which the Monty Hall and three-prisoners puzzle are special cases. 
In Section 4 we give several characterizations of CAR. We supply conditions under which 
it is guaranteed to hold and guaranteed not to hold, and we give a randomized algorithm 
that generates all and only distributions for which CAR holds. In Section 5 we consider 
the case where the information is not in the form of an event. We first consider situations 
where Jeffrey conditioning can be applied. We show that Jeffrey conditioning in the naive 
space gives the appropriate answer iff a generalized CAR condition holds. We then show 
that, typically, applying MRE in the naive space does not give the appropriate answer. 
We conclude with some discussion of the implication of these results in Section 6. 

2 Naive vs. Sophisticated Spaces 

Our formal model is a special case of the multi-agent systems framework [Halpern and 
Fagin 1989], which is essentially the same as that used in [Friedman and Halpern 1997] 
to model belief revision. We assume that there is some external world in a set W, and 
an agent who makes observations or gets information about that world. We can describe 
the situation by a pair (w, I), where w G W is the actual world, and I is the agent's local 
state, which essentially characterizes her information. W is what we called the "naive 
space" in the introduction. For the purposes of this paper, we assume that / has the form 
(oi, . . . , o n ), where Oj is the observation that the agent makes at time j, j — 1, . . . , n. This 
representation implicitly assumes that the agent remembers everything she has observed 
(since her local state encodes all the previous observations). Thus, we ignore memory 
issues here. We also ignore computational issues, just so as to be able to focus on when 
conditioning is appropriate. 

A pair (w, (oi, . . . , o n )) is called a run. A run may be viewed as a complete description 
of what happens over time in one possible execution of the system. For simplicity, in 
this paper, we assume that the state of the world does not change over time. The 
"sophisticated space" is the set of all possible runs. 

In the Monty Hall puzzle, the naive space has three worlds, representing the three 
possible locations of the car. The sophisticated space describes what Monty would have 
said in all circumstances (i.e., Monty's protocol) as well as where the car is. The three- 
prisoners puzzle is treated in detail in Example 2.1 below. While in these cases the so- 
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phisticated space is still relatively simple, this is no longer the case for the Judy Benjamin 
puzzle. Although the naive space has only four elements, constructing the sophisticated 
space involves considering all the things that HQ could have said, which is far from clear, 
and the conditions under which HQ says any particular thing. Grove and Halpern [1997] 
discuss the difficulties in constructing such a sophisticated space. 

In general, not only is it not clear what the sophisticated space is, but the need for a 
sophisticated space and the form it must take may become clear only after the fact. For 
example, in the Judy Benjamin problem, before contacting headquarters, Judy would 
almost certainly not have had a sophisticated space in mind (even assuming she was an 
expert in probability), and could not have known the form it would have to take until 
after hearing headquarter's response. 

In any case, if the agent has a prior probability on the set TZ of possible runs in the 
sophisticated space, after hearing or observing (oi, . . . ,Ok), she can condition, to get a 
posterior on 1Z. Formally, the agent is conditioning her prior on the set of runs where 
her local state at time k is (o±, . . . , 0&). 

Clearly the agent's probability Pr on 1Z induces a probability Pr^/ on W by marginal- 
ization. We are interested in whether the agent can compute her posterior on W after 
observing (oi, . . . , Ok) in a relatively simple way, without having to work in the sophisti- 
cated space. 

Example 2.1: Consider the three-prisoners puzzle in more detail. Here the naive space is 
W = {w a , Wb, w c }, where w x is the world where x is not executed. We are only interested 
in runs of length 1, so n — 1. The set O of observations (what agent can be told) is 
{{w a ,Wb},{w a ,w c }}. Here u {w a ,Wb}" corresponds to the observation that either a or 
b will not be executed (i.e., the jailer saying "c will be executed"); similarly, {w a ,w c } 
corresponds to the jailer saying "6 will be executed". The sophisticated space consists of 
the four runs 

(w a , (K,%}», Ki (K,«"c}», fab, (K,%})), (u>c, ({w a ,w c })). 

Note that there is no run with observation ({wb, w c }), since the jailer will not tell a that 
he will be executed. 

According to the story, the prior Pr w in the naive space has Pt w (w) = 1/3 for 
w G W. The full distribution Pr on the runs is not completely specified by the story. In 
particular, we are not told the probability with which the jailer will say b and c if a will 
not be executed. We return to this point in Example 3.2. | 



3 The CAR Condition 

A particularly simple setting is where the agent observes or learns that the external world 
is in some set U C W . For simplicity, we assume throughout this paper that the agent 
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makes only one observation, and makes it at the first step of the run. Thus, the set O of 
possible observations consists of nonempty subsets of W. Thus, any run r can be written 
as r = (w, ([/)) where w is the actual world and U is a nonempty subset of W. However, 
O does not necessarily consist of all the nonempty subsets of W. Some subsets may never 
be observed. For example, in Example 2.1, a is never told that he will be executed, so 
{wb,w c } is not observed. We assume that the agent's observations are accurate, in that 
if the agent observes U in a run r, then the actual world in r is in U. That is, we assume 
that all runs are of the form r — (w, (U)) where w G U. In Example 2.1, accuracy is 
enforced by the requirement that runs have the form (w x , ({w x ,w y })). 

The observation or information obtained does not have to be exactly of the form 
"the actual world is in [/". It suffices that it is equivalent to such a statement. This is 
the case in both the Monty Hall puzzle and the three-prisoners puzzle. For example, in 
the three-prisoners puzzle, being told that b will be executed is essentially equivalent to 
observing {w a ,w c } (either a or c will not be executed). 

In this setting, we can ask whether, after observing U, the agent can compute her 
posterior on W by conditioning on U. Roughly speaking, this amounts to asking whether 
observing U is the same as discovering that U is true. This may not be the case in 
general — observing or being told U may carry more information than just the fact that 
U is true. For example, if for some reason a knows that the jailer would never say c if 
he could help it (so that, in particular, if b and c will be executed, then he will definitely 
say b), then hearing c (i.e., observing {w a ,Wb}) tells a much more than the fact that the 
true world is one of w a or Wb- It says that the true world must be Wb (for if the true 
world were w a , the jailer would have said b). 

In the remainder of this paper we assume that W is finite. For every scenario we 
consider we define a set of possible observations O, consisting of nonempty subsets of W. 
For given W and O, the set of runs 1Z is then defined to be the set 

K = {(w, (U)\ U eO,we U}. 

Given our assumptions that the state does not change over time and that the agent 
makes only one observation, the set 1Z of runs can be viewed as a subset of W x O. 
While just taking 7Z to be a subset of W x O would slightly simplify the presentation 
here, in general, we certainly want to allow sequences of observations. (Consider, for 
example, an n-door version of the Monty Hall problem, where Monty opens a sequence 
of doors.) This framework extends naturally to that setting. 

Whenever we speak of a distribution Pr on 7Z, we implicitly assume that the prob- 
ability of any set on which we condition is strictly greater than 0. Let Xw and Xo be 
two random variables on 7Z, where X w is the actual world and X Q is the observed event. 
Thus, for r = (w, (U)), Xw(r) = w and Xq{t) = U. Given a distribution Pr on runs TZ, 
we denote by Pr\y the marginal distribution of Xw, and by Pro the marginal distribution 
of X - For example, for V,U C W, Yi w (V) is short for Vi(X w G V) and Vi w (V \ U) is 
short for Vi{X w EV\X W eU). 
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Let Pr be a prior on 1Z and let Pr' = Pr(- | X Q = U) be the posterior after observing 
U . The main question we ask in this paper is under what conditions we have 

Pr' w (V) = Pv w (V\U) (1) 

for all V C W. That is, we want to know under what conditions the posterior W 
induced by Pr' can be computed from the prior on W by conditioning on the observation. 
(Example 3.2 below gives a concrete case.) We stress that Pr and Pr' are distributions 
on 71, while Prw and Pr'^ are distributions on W (obtained by marginalization from Pr 
and Pr', respectively). Note that (1) is equivalently stated as 

Pr(X w = W \X = U)= Pr(X w = w\X w eU) for all weU. (2) 

(1) (equivalently, (2)) is called the "CAR condition". It can be characterized as follows: 

Theorem 3.1: [Gill, van der Laan, and Robins 1997] Fix a probability Pr on 7Z and a 
set U C W. The following are equivalent: 

(a) If Pr(X G = U) > 0, then Pi{X w = W \X = U) = Pi{X w = w\X w eU) for all 
w eU. 

(b) The event X w = w is independent of the event X Q = U given X w G U , for all 
w elf. 

(c) Pr{X = U | X w = w) = Pr(X = U\X w eU) for all w G U such that Pi{X w = 
w) > 0. 

(d) Pr(X G = U | X w = w) = Pr(X G = U\X W = w') for all w,w' e U such that 
Pr(X w = w)>0 and Pr(X w = w') > 0. 

For completeness (and because it is useful for our later Theorem 5.1), we provide a 
proof of Theorem 3.1 in the appendix. 

The first condition in Theorem 3.1 is just (2). The third and fourth conditions justify 
the name "coarsening at random". Intuitively, first some world w G W is realized, and 
then some "coarsening mechanism" decides which event U C W such that w G U is 
revealed to the agent. The event U is called a "coarsening" of w. The third and fourth 
conditions effectively say that the probability that w is coarsened to U is the same for 
all w G U. This means that the "coarsening mechanism" is such that the probability of 
observing U is not affected by the specific value of w G U that was realized. 

In the remainder of this paper, when we say "Pr satisfies CAR", we mean that Pr 
satisfies condition (a) of Theorem 3.1 (or, equivalently, any of the other three conditions) 
for all U G O. Thus, "Pr satisfies CAR" means that conditioning in the naive space 
W coincides with conditioning in the sophisticated space 1Z with probability 1. The 
CAR condition explains why conditioning in the naive space is not appropriate in the 
Monty Hall puzzle or the three-prisoners puzzle. We consider the three-prisoners puzzle 
in detail; a similar analysis applies to Monty Hall. 
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Example 3.2: In the three-prisoners puzzle, what is a's prior distribution Pr on 1Z7 In 
Example 2.1 we assumed that the marginal distribution Pr^y on W is uniform. Apart 
from this, Pr is unspecified. Now suppose that a observes {w a ,w c } ("the jailer says 
6"). Naive conditioning would lead a to adopt the distribution Prw(- 1 {w a ,w c }). This 
distribution satisfies Prw(w a \ {w a , w c }) = 1/2. Sophisticated conditioning leads a to 
adopt the distribution Pr' = Pr(- \X Q = {w a ,w c }). By part (d) of Theorem 3.1, naive 
conditioning is appropriate (i.e., Pr'^ = Pr^(- | {w a ,w c })) only if the jailer is equally 
likely to say b in both worlds w a and w c . Since the jailer must say that b will be executed 
in world w c , it follows that Pr(X = {w a ,w c } \X W = w c ) = 1. Thus, conditioning is 
appropriate only if the jailer's protocol is such that he definitely says b in w a , i.e., even 
if both b and c are executed. But if this is the case, when the jailer says c, conditioning 
Ptw on {w a , Wb} is not appropriate, since then a knows that he will be executed. The 
world cannot be w a , for then the jailer would have said b. Therefore, no matter what the 
jailer's protocol is, conditioning in the naive space cannot coincide with conditioning in 
the sophisticated space for both of his responses. I 

The following example shows that in general, in settings of the type arising in the Monty 
Hall and the three-prisoners puzzle, the CAR condition can only be satisfied in very 
special cases: 

Example 3.3: Suppose that O = {Ui,U 2 }, and both U\ and U 2 are observed with 
positive probability. (This is the case for both Monty Hall and the three-prisoners puz- 
zle.) Then the CAR condition (Theorem 3.1(c)) cannot hold for both U\ and U 2 unless 
Pr(X^ G E/inE/2) is either or 1. For suppose that Pr(X G = U x ) > 0, Pr(X G = U 2 ) > 0, 
and < Pr(X w e XJ\ fl U 2 ) < 1. Without loss of generality, there is some w 1 e XJ\ — U 2 
and w 2 G U\ fl U 2 such that Pr(X w = Wi) > and Pr(X w = w 2 ) > 0. Since observations 
are accurate, we must have Pr(Xo = U\ \ Xy/ = wi) = 1. If CAR holds for Ui, then we 
must have Pr(Xo = U\ \ Xw = w 2 ) = 1. But then Pr(Xo = U 2 \ Xw = w 2 ) = 0. But 
since Pr(X = U 2 ) > 0, it follows that there is some w 3 G U 2 such that Pr(X w = w 3 ) > 
and Pr(A" = U 2 \ X w = w 3 ) > 0. This contradicts the CAR condition. | 

So when does CAR hold? The previous example exhibited a combination of O and W 
for which CAR can only be satisfied in "degenerate" cases. In the next section, we shall 
study this question for arbitrary combinations of O and W. 

4 Characterizing CAR 

In this section, we provide some characterizations of when the CAR condition holds, for 
finite O and W. Our results extend earlier results of Gill, van der Laan, and Robins 
[1997]. We first exhibit a simple situation in which CAR is guaranteed to hold, and we 
show that this is the only situation in which it is guaranteed to hold. We then show 
that, for arbitrary O and W, we can construct a 0-1-valued matrix from which a strong 
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necessary condition for CAR to hold can be obtained. It turns out that, in some cases 
of interest, CAR is (roughly speaking) guaranteed not to hold except in "degenerate" 
situations. Finally, we introduce a new "procedural" characterization of CAR: we provide 
a mechanism such that a distribution Pr can be thought of as arising from the mechanism 
if and only if Pr satisfies CAR. 

4.1 When CAR is guaranteed to hold 

We first consider the only situation where CAR is guaranteed to hold: if the sets in O 
are pairwise disjoint. 

Proposition 4.1: The CAR condition holds for all distributions Pr on 1Z if and only if 
O consists of pairwise disjoint subsets ofW. 

What happens if the sets in O are not pairwise disjoint? Are there still cases (com- 
binations of O, W, and distributions on 1Z) when CAR holds? There are, but they are 
quite special. 



We now present a lemma that provides a new characterization of CAR in terms of a 
simple 0/1-matrix. The lemma allows us to determine for many combinations of O and 
W, whether a distribution on 1Z exists that satisfies CAR and gives certain worlds positive 
probability. 

Fix a set 1Z of runs, whose worlds are in some finite set W and whose observations 
come from some finite set O = {U\, . . . , U n }. We say that A C W is an IZ-atom relative 
to W and O if A has the form V± D . . . H V n , where each Vi is either C/j or C/j, and 
{r G 1Z : Xw{r) G A} ^ 0. Let A = {Ai, . . . ,A m } be the set of 7£-atoms relative to 
W and O. We can think of A as a partition of the worlds according to what can be 
observed. Two worlds w 1 and u> 2 are in the same set Aj, G A if there are no observations 
that distinguish them; that is, there is no observation U G O such that w± G U and 
ui2 G" U . Define the m x n matrix S with entries s^- as follows: 



We call S the CARacterizing matrix (for O and W). Note that each row % in S corresponds 
to a unique atom in A; we call this the atom corresponding to row i. This matrix (actually, 
its transpose) was first introduced (but for a different purpose) in [Gill, van der Laan, 
and Robins 1997]. 



4.2 When CAR may hold 




otherwise. 



1 if Ai C Uj 



(3) 
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Example 4.2: Returning to Example 3.3, the CARacterizing matrix is given by 



/ 1 
1 1 



) 



V i 



where the columns correspond to U\ and U 2 and the rows correspond to the three atoms 
Ui — U2, U\ fl U 2 and U 2 — U\. For example, the fact that entry s 3 i of this matrix is 
indicates that U\ cannot be observed if the actual world w is in U2 — U\. I 

In the following lemma, 7 T denotes the transpose of the (row) vector 7, and 1 denotes 
the row vector consisting of all Is. 

Lemma 4.3: Let 1Z be the set of runs over observations O and worlds W , and let S be 
the CARacterizing matrix for O and W . 

(a) Let Pr be any distribution over 1Z and let S' be the matrix obtained by deleting from 
S all rows corresponding to an atom A with Pr(X w G A) — 0. Define the vector 
7 = (71, . . . , 7„) by setting 7,- = Pr(X G = Uj \ X w G Uj) if Pr(X w G Uj) > 0, and 
7j = otherwise, for j — 1 . . . n. If Ft satisfies CAR, then S' ■ 7 T = 1 T . 

(b) Let S' be a matrix consisting of a subset of the rows of S, and let Vw,s' be the set 
of distributions over W with support corresponding to S' ; i.e., 



If there exists a vector 7 > such that S' ■ 7 T = 1 T , then, for all Pw £ Vw,S', there 
exists a distribution Pr over 1Z with Pr^ = Pw (i-e., the marginal of Pr on W is 
Pw) such that (a) Pr satisfies CAR and (b) Pr(Xo = Uj \ Xw G Uj) = jj for all j 
with Pr(X w eUj)>0. 

Note that (b) is essentially a converse of (a). A natural question to ask is whether (b) 
would still hold if we replaced "for all Pw G Vw,S' there exists Pr satisfying CAR with 
Pr w = Pw" by "for all distributions P over O there exists Pr satisfying CAR with 
Pr = Po- n The answer is no; see Example 4.6(b) (ii). 

Lemma 4.3 says that a distribution Pr that satisfies CAR and at the same time has 
~Py(Xw G A) > for m different atoms A can exist if and only if a certain set of m linear 
equations in n unknowns has a solution. In many situations of interest, m > n (note that 
m may be as large as 2™ — 1). Not surprisingly then, in such situations there often can be 
no distribution Pr that satisfies CAR, as we show in the next subsection. On the other 
hand, if the set of equations S'^ T = 1 does have a solution in 7, then the set of all solutions 
forms the intersection of an afline subspace (i.e. a hyperplane) of R n and the positive 
orthant [0, oo) n . These solutions are just the conditional probabilities Pr(Xo = Uj \ Xw G 
Uj) for all distributions for which CAR holds that have support corresponding to S'. 



Pw,S' = 



{Pw I Pw{A) > iff A corresponds to a row in S'}. 
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These conditional probabilities may then be extended to a distribution over 7Z by setting 
Pr^ = P\y for an arbitrary distribution Pw over the worlds in atoms corresponding to 
S'; all Pr constructed in this way satisfy CAR. 

Summarizing, we have the remarkable fact that for any given set of atoms A there 
are only two possibilities: either no distribution exists which has Pr(Xw G A) > for 
all A G A and satisfies CAR, or for all distributions Pw over worlds corresponding to 
atoms in A, there exists a distribution satisfying CAR with marginal distribution over 
worlds equal to Pw- 

4.3 When CAR is guaranteed not to hold 

We now present a theorem that gives two explicit and easy-to-check sufficient conditions 
under which CAR cannot hold unless the probabilities of some atoms and/or observations 
are 0. The theorem is proved by showing that the condition of Lemma 4.3(a) cannot 
hold under the stated conditions. 

We briefly recall some standard definitions from linear algebra. A set of vectors 
Vi, . . . , v m is called linearly dependent if there exist coefficients Ai, . . . , \ m (not all zero) 
such that Y%Li K^i = 0; the vectors are affinely dependent if there exist coefficients 
Ai, . . . , A m (not all zero) such that Y%Li Kvi = and YhLi \ — 0- A vector u is called an 
affine combination of v\, . . . , v m if there exist coefficients X±, . . . ,X m such that YhL\ — 
u and YT=i x i = 0. 

Theorem 4.4: Let 71 be a set of runs over observations O = {U\, . . . , U n } and worlds 
W , and let S be the CARacterizing matrix for O and W . 

(a) Suppose that there exists a subset R of the rows in S and a vector u = (ui, . . . , u n ) 
that is an affine combination of the rows of R such that Uj > for all j G {1, . . . , n} 
and Uj* > for some j* e {1, . . . , n}. Then there is no distribution Pr on 1Z that 
satisfies CAR such that Pr(Ao = Uj*) > and Pr(Xy^ G A) > for each 71- atom 
A corresponding to a row in R. 

(b) If there exists a subset R of the rows of S that is linearly dependent but not affinely 
dependent, then there is no distribution Pr on 7Z that satisfies CAR such that 
Pi(Xw G A) > for each 7Z-atom A corresponding to a row in R. 

(c) Given a set R consisting of n linearly independent rows of S and a distribution 
Pw on W such that Pw{A) > for all A corresponding to a row in R, there is a 
unique distribution Po on O such that if Pr is a distribution on 7Z satisfying CAR 
and Pr(Xw G A) = Pw(A) for each atom A corresponding to a row in R, then 
Pt(X = U)=Po(U). 

It is well known that in an m x n matrix, at most n rows can be linearly independent. 
In many cases of interest (cf. Example 4.5 below), the number of atoms m is larger than 
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the number of observations n, so that there must exist subsets R of rows of S that are 
linearly dependent. Thus, part (b) of Theorem 4.4 puts nontrivial constraints on the 
distributions that satisfy CAR. 

The requirement in part (a) may seem somewhat obscure but it can be easily checked 
and applied in a number of situations, as illustrated in Example 4.5 and 4.6 below. Part 
(c) says that in many other cases of interest where neither part (a) nor (b) applies, even 
if a distribution on 1Z exists satisfying CAR, the probabilities of making the observations 
are completely determined by the probability of various events in the world occurring, 
which seems rather unreasonable. 

Example 4.5: Consider the CARacterizing matrix of Example 4.2. Notice there exists 
an affine combination of the first two rows that is not and has no negative components: 



Similarly, there exists an affine combination of the last two rows that is not and has 
no negative components. It follows from Theorem 4.4(a) that there is no distribution 
satisfying CAR that gives both of the observations Xo = U\ and Xo = U 2 positive 
probability and either (a) gives both Xw G U± — U 2 and Xw G U\\~\U 2 positive probability 
or (b) gives both X w G U 2 — U\ and X w G U± D U 2 positive probability. If both 
observations have positive probability, then CAR can hold only if the probability of 
U\ fl U 2 is either or 1. (Example 3.3 already shows this using a more direct argument.) 
I 

The next example further illustrates that in general, it can be very difficult to satisfy 
CAR. 

Example 4.6: Suppose that O = {U\, U 2 , U 3 }, and all three observations can be made 
with positive probability. It turns out that in this situation the CAR condition can hold, 
but only if (a) Pr(X w G Ui n U 2 n U 3 ) = 1 (i.e., all of U u U 2 , and U 3 must hold), (b) 
Pr(X w G ((U 1 nU2)-U 3 )\J((U 2 nU 3 )-U 1 )U((U 1 nU 3 )-U2)) = 1 (i.e., exactly two of U ± , 
U 2 , and U 3 must hold), (c) Pi(X w G (^-(^UU^U^-^UU^U^-^U^))) = 1 
(i.e., exactly one of Ui, U 2 , or U 3 must hold), or (d) one of (Ui — (U 2 U U 3 )) U (U 2 D U 3 ), 
(U 2 - U U 3 )) U (U Y n U 3 ) or (U 3 - U U 2 )) U (Z7i n U 2 ) has probability 1 (either 
exactly one of U 1: U 2 , or U 3 holds, or the remaining two both hold). 

We first check that CAR can hold in all these cases. It should be clear that CAR 
can hold in case (a). Moreover, there are no constraints on Pr(Xo = Ui \ Xw = w) for 
w G U\ H U 2 n U 3 (except, by the CAR condition, for each fixed i, the probability must 
be the same for all w G Ui D U 2 n U 3 , and the three probabilities must sum to 1). 

For case (b), let Ai be the atom where exactly two of Ui, U 2 , and U 3 hold, and U 
does not hold, for i = 1,2, 3. Suppose that Pr(Xw G Ai U A 2 U A3) = 1. Note that, since 
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all three observations can be made with positive probability, at least two of A x , A 2 , and 
A 3 must have positive probability. Hence we can distinguish between two subcases: (i) 
only two of them have positive probability, and (ii) all three have positive probability. 

For subcase (i), suppose without loss of generality that only Ai and A 2 have positive 
probability. Then it immediately follows from the CAR condition that there must be 
some a with < a < 1 such that Pr(X G = C/ 3 | X w = w) = a, for all w E A x U A 2 
such that Pr(Xw = w) > 0. Thus, Pr(Xo = U\ \ Xw = w) = 1 — a for all w E A 2 such 
that Pr(X w = w) > 0, and Pr(X = U 2 \ X w = w) = 1 — a for all w E Ai such that 
Pi(X w = w) > 0. 

Subcase (ii) is more interesting. The rows of the CARacterizing matrix S correspond- 
ing to Ai, A 2 , and A 3 are (0 1 1), (1 1), and (1 1 0), respectively. Now Lemma 4.3(a) 
tells us that if Pr satisfies CAR, then we must have S ■ 7 T = 1 T for some 7 = (71, 72, 73) 
with 7, = Pr(X G = U | X w E Ui). These three linear equations have solution 

1 

7i = 72 = 7s = 2" 

Since this solution is unique, it follows by Lemma 4.3(b) that all distributions that 
satisfy CAR must have conditional probabilities Pr(X = Ui \ X w E Ui) = 1/2, and that 
their marginal distributions on W can be arbitrary. This fully characterizes the set of 
distributions Pr for which CAR holds in this case. Note that for i — 1,2, 3, since we can 
write 7i = Pr(X G = Ui)/Pr(X w E U t ) we have Pr(X G = U t ) = Pr(X w E Ufa < 1/2 so 
that, in contrast to the marginal distribution over W, the marginal distribution over O 
cannot be chosen arbitrarily. 

In case (c), it should also be clear that CAR can hold. Moreover, Pr(X = U | X w = 
w) is either or 1, depending on whether w E Ui. Finally, for case (d), suppose that 
Pr(X w eUiU (U 2 n U 3 )) = 1. CAR holds iff there exists a such that Pr(X = U 2 \ X w = 
w) = a and Pr(Xo = U3 \ Xw — w) — 1—a for all w E U 2 nU 3 such that Pi(Xw = w) > 0. 
(Of course, Pr(X G = U\ \ X w — w) — 1 for all w E U\ such that Py{X w — w) > 0.) 

Now we show that CAR cannot hold in any other case. First suppose that < 
Pr(AV E U± fl U 2 fl Us) < 1. Thus, there must be at least one other atom A such 
that Pr(X w E A) > 0. The row corresponding to the atom U x D U 2 n U 3 is (1 1 1). 
Suppose r is the row corresponding to the other atom A. Since S is a 0-1 matrix, the 
vector (1 1 1) — r gives is an afline combination of (1 1 1) and r that is nonzero and has 
nonnegative components. It now follows by Theorem 4.4 that CAR cannot hold in this 
case. 

Similar arguments give a contradiction in all the other cases; we leave details to the 
reader. | 
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4.4 Discussion: "CAR is everything" vs. "sometimes CAR is 
nothing" 

In one of their main theorems, Gill, van der Laan, and Robins [1997, Section 2] show that 
the CAR assumption is untestable from observations of X alone, in the sense that the 
assumption "Pr satisfies CAR" imposes no restrictions at all on the marginal distribution 
Pro on Xo- More precisely, they show that for every finite set W of worlds, every set O 
of observations, and every distribution Po on O, there is a distribution Pr* on 1Z such 
that Pr (the marginal of Pr* on O) is equal to Po and Pr* satisfies CAR. The authors 
summarize this as "CAR is everything" . 

We must be careful in interpreting this result. Theorem 4.4 shows that, for many 
combinations of O and W, CAR can hold only for distributions Pr with Pr(X w G A) = 
for some atoms A. (In the previous sections, we called such distributions "degenerate".) 
In our view, this says that in some cases, CAR effectively cannot hold. To see why, 
first suppose we are given a set W of worlds and a set O of observations. Now we 
may feel confident a priori that some Uo G O and some Wo G W cannot occur in 
practice. In this case, we are willing to consider only distributions Pr on O x W that 
have Pr(Xo = Uo) = 0, Pr(A^ = wq) = 0. (For example, W may be a product space 
W = W a x Wf, and it is known that some combination w a G W a and Wb in Wf, can never 
occur together; then Pr(X„, = (w a ,Wf,)) = 0.) Define O* to be the subset of O consisting 
of all U that we cannot a priori rule out; similarly, W* is the subset of W consisting of 
all w that we cannot a priori rule out. By Theorem 4.4, it is still possible that O* and 
W* are such that, even if we restrict to runs where only observations in O* are made, 
CAR can only hold if Pr(Xw G A) = for some atoms (nonempty subsets) A C W*. 
This means that CAR may force us to assign probability to some events that, a priori, 
were considered possible. Examples 3.3 and 4.6 illustrate this phenomeonon. We may 
summarize this as "sometimes CAR is nothing" . 

Given therefore that CAR imposes such strong conditions, the reader may wonder 
why there is so much study of the CAR condition in the statistics literature. The reason 
is that some of the special situations in which CAR holds often arise in missing data and 
survival analysis problems. Here is an example: Suppose that the set of observations can 
be written as O = Uf =1 ITj, where each IT is a partition of W (that is, a set of pairwise 
disjoint subsets of W whose union is W). Further suppose that observations are generated 
by the following process, which we call CARgen. Some % between 1 and k is chosen 
according to some arbitrary distribution P ; independently, w G W is chosen according 
to Pw- The agent then observes the unique U G 11, such that w G U. Intuitively, the 
partitions IT may represent the observations that can be made with a particular sensor. 
Thus, Po determines the probability that a particular sensor is chosen; Pw determines 
the probability that a particular world is chosen. The sensor and the world together 
determine the observation that is made. It is easy to see that this mechanism induces a 
distribution on 1Z for which CAR holds. 

The special case with O = IT un 2 , IT = {W}, and n 2 = {{w} | w G W} corresponds 
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to a simple missing data problem (Example 4.7 below). Intuitively, either complete 
information is given, or there is no data at all. In this context, CAR is often called MAR: 
missing at random. In more realistic MAR problems, we may observe a vector with some 
of its components missing. In such cases the CAR condition sometimes still holds. In 
practical missing data problems, the goal is often to infer the distribution Pr on runs 1Z 
from successive observations of X Q . That is, one observes a sample C/(i), Up), . . . , £/(„), 
where Uu\ G O. Typically, the Uu\ are assumed to be an i.i.d. (independently identically 
distributed) sample of outcomes of Xq. The corresponding "worlds" w±, u> 2 , ■ ■ ■ (outcomes 
of X w ) are not observed. Depending on the situation, Pr may be completely unknown or 
is assumed to be a member of some parametric family of distributions. If the number of 
observations n is large, then clearly the sample C/(i), Up), . . . , U( n ) can be used to obtain 
a reasonable estimate of Pro, the marginal distribution on Xo- But one is interested 
in the full distribution Pr. That distribution usually cannot be inferred without making 
additional assumptions, such as the CAR assumption. 

Example 4.7: (adapted from [Scharfstein, Daniels, and Robins 2002]) Suppose that a 
medical study is conducted to test the effect of a new drug. The drug is administered 
to a group of patients on a weekly basis. Before the experiment is started and after it is 
finished, some characteristic (say, the blood pressure) of the patients is measured. The 
data are thus differences in blood pressure for individual patients before and after the 
treatment. In practical studies of this kind, often several of the patients drop out of the 
experiment. For such patients there is then no data. We model this as follows: W is 
the set of possible values of the characteristic we are interested in (e.g., blood pressure 
difference). O = Hi U n 2 with Hi = {W}, and n 2 = {{w} \ w G W} as above. For 
"compilers" (patients that did not drop out), we observe X Q = {w}, where w is the 
value of the characteristic we want to measure. For dropouts, we observe Xo = W (that 
is, we observe nothing at all). We thus have, for example, a sequence of observations 
Ui = {w 1 },U 2 = {w 2 },U 3 = W,U 4 = {w 4 },U 5 = W,...,U n = {w n }. If this sample 
is large enough, we can use it to obtain a reasonable estimate of the probability that a 
patient drops out (the ratio of outcomes with U — W to the total number of outcomes). 
We can also get a reasonable estimate of the distribution of Xw for the complying 
patients. Together these two distributions determine the distribution of Xo- 

We are interested in the effect of the drug in the general population. Unfortunately, 
it may be the case that the effect on dropouts is different from the effect on compilers. 
(Scharfstein, Daniels, and Robins [2002] discuss an actual medical study in which physi- 
cians judged the effect on dropouts to be very different from the effect of compilers.) 
Then we cannot infer the distribution on W from the observations Ui, C/ 2 , . . . alone with- 
out making additional assumptions about how the distribution for dropouts is related to 
the distribution for compilers. Perhaps the simplest such assumption that one can make 
is that the distribution of Xw for dropouts is in fact the same as the distribution of Xw 
for compilers: the data are "missing at random". Of course, this assumption is just the 
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CAR assumption. By Theorem 3.1(a), CAR holds iff for all w G W 

Pr(X w = w | X = W) = Pr(X w = W \X W eW) = Vi(X w = w), 

which means just that the distribution of W is independent of whether a patient drops 
out (Xo = W) or not. Thus, if CAR can be assumed, then we can infer the distribution 
on W (which is what we are really interested in). | 

Many problems in missing data and survival analysis are of the kind illustrated above: 
The analysis would be greatly simplified if CAR holds, but whether or not this is so 
is not clear. It is therefore of obvious interest to investigate whether, from observing 
the "coarsened" data U(i), Up), ■ ■ • , C/( n ) alone, it may already be possible to test the 
assumption that CAR holds. For example, one might imagine that there are distributions 
on Xo for which CAR simply cannot hold. If the empirical distribution of the U were 
"close" (in the appropriate sense) to a distribution that rules out CAR, the statistician 
might infer that Pr does not satisfy CAR. Unfortunately, if O is finite, then the result 
of Gill, van der Laan, and Robins [1997, Section 2] referred to at the beginning of this 
section shows that we can never rule out CAR in this way. 

We are interested in the question of whether CAR can hold in a "nondegenerate" 
sense, given O and W. From this point of view, the slogan "sometimes CAR is nothing" 
makes sense. In contrast, [Gill, van der Laan, and Robins 1997] were interested in the 
question whether CAR can be tested from observations of Xo alone. From that point 
of view, the slogan "CAR is everything" makes perfect sense. In fact, Gill, van der 
Laan, and Robins were quite aware, and explicitly stated, that CAR imposes very strong 
assumptions on the distribution Pr. In a later paper, it was even implicitly stated that 
in some cases CAR forces Pr(Xyy e A) = for some atoms A [Robins, Rotnitzky, and 
Scharfstein 1999, Section 9.1]. Our contribution is to provide the precise conditions 
(Lemma 4.3 and Theorem 4.4) under which this happens. 

Robins, Rodnitzky, and Scharfstein [1999] also introduced a Bayesian method (later 
extended in [Scharfstein, Daniels, and Robins 2002]) that allows one to specify a prior 
distribution over a parameter a which indicates in a precise sense, how much Pr deviates 
from CAR. For example, a = corresponds to the set of distributions Pr satisfying CAR. 
The precise connection between this work and ours needs further investigation. 

4.5 A mechanism for generating distributions satisfying CAR 

In Theorem 3.1 and Lemma 4.3 we described CAR in an algebraic way, as a collection 
of probabilities satisfying certain equalities. Is there a more "procedural" way of repre- 
senting CAR? In particular, does there exist a single mechanism that gives rise to CAR 
such that every case of CAR can be viewed as a special case of this mechanism? 

Before we can answer this question, we have to make clear what counts as a mech- 
anism. Without any constrainst, there is clearly a trivial solution to the problem, as 
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already noted by Gill, van der Laan, and Robins [1997]: Given a distribution Pr satis- 
fying CAR, we simply draw a world w according to Pr^, and then draw U such w EU 
according to the distribution Pr(Xo = U \ Xy/ = w). This is obviously cheating in some 
sense. Intuitively, the problem here is that we cannot "choose" U according to a certain 
distribution. We do not have that kind of control over the observations that are made. 

So what can we do? Intuitively, the mechanism should be able to control only what 
can be controlled in an experimental setup. While it is fair to assume that we are given 
some sensor, it is not fair to assume that we can control their output (or exactly what 
they can sense). Assume that we are given a world w G W, generated according to some 
distribution Pw- Intuitively, we do not have control over Pw- Given Pw, our goal is to 
find a procedure that generates all and only the distributions Pr satisfying CAR such 
that Piw = Pw- One approach is to assume that the agent gets to make observations, 
using possibly different sensors. While the agent can choose which sensor to observe, 
it cannot choose what the sensor observes. Indeed, given a world w, then observation 
returned by the sensor is determined. This is exactly what is done in the CARgen 
scheme discussed in Section 4.4. 

Gill, van der Laan, and Robins [1997] consider another approach. They show that in 
several problems of survival analysis, observations are generated according to what they 
call a randomized monotone coarsening scheme. They also show that their randomized 
scheme generates only distributions that satisfy CAR. In fact, the randomized monotone 
coarsening scheme turns out to be a special case of CARgen, although we do not prove 
this here. Gill, van der Laan, and Robins show by example that the randomized coars- 
ening schemes do not suffice to generate all CAR distributions. We now use essentially 
the same example to show that CARgen does not either. 

Example 4.8: Consider subcase (ii) of Example 4.6 again. Let Ui, U 2 , U3 and Ai, A 2 and 
A 3 be as in that example, and assume for simplicity that W = Ai U A 2 U A 3 . The example 
showed that there exists distributions Pr satisfying CAR in this case with Pr(Ai) > 
for % G {1, 2, 3}, all having conditional probabilities Pr(Xo = Ui \ Xw = w) = 1/2 for all 
w G Ui. Clearly, U\,U 2 and U3 cannot be grouped together to form a set of partitions of 
W. So, even though CAR holds for Pr, CARgen cannot be used to simulate Pr. | 

The problem of finding a natural mechanism that generates all and only distributions 
that satisfy CAR seems to be one of the goal of Gill, van der Laan, and Robins' work 
(see, in particular, [1997, Section 3]), although they do not formulate the problem pre- 
cisely. While we also do not give a precise formulation of what counts as a reasonable 
mechanism (although it can be done in the runs framework — essentially, each step of 
the algorithm can depend only on information available to the experimenter, where the 
"information" is encoded in the observations made by the experimenter in the course of 
running the algorithm), we do give an argument that the mechanism we propose is in 
fact reasonable. We call the procedure CARgen*, since it extends CARgen. Just like 
CARgen, CARgen* assumes that there is a collection of sensors, and it consults a given 
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sensor with a certain predetermined probability. However, unlike CARgen, CARgen* 
may ignore a sensor reading. 

Procedure CARgen* 

1. Preparation: 

• Fix an arbitrary distribution Pyy on W . 

• Fix a set V of partitions of W, and fix an arbitrary distribution P-p on V. 

• Choose numbers q G [0,1) and qjjm G [0,1] for each pair (U,U) such that 
II G V and U G IT satisfying the following constraint, for each w G W such 
that iV(iu) > 0: 

q= J2 Pv(K)qu\n. (4) 

{(u,uy. w£U,ueu} 

2. Generation: 

2.1 Choose w eW according to P w . 

2.2 Choose IT G V according to P v . Let U be the unique set in II such that w G U. 

2.3 With probability 1 — qu\u, return (w, U) and halt. With probability qu\n, go 
to step 2.2. 

It is easy to see that CARgen is the special case of CARgen* where qu\u = for all 
(U, IT). Allowing qu\xi > gives us a little more flexibility. To understand the role of the 
constraint (4), note that qu\u is the probability that the algorithm does not terminate at 
step 2.3, given that U and II are chosen at step 2.2. It follows that the probability q w 
that a pair (w,U) is not output at step 2.3 for some U is 

Qw = Pr(R)(lu\n- 
{(u,uy. weu,uen} 

Thus, (4) says that the probability q w that a pair whose first component is w is not 
output at step 2.3 is the same for all w G W. 

CARgen* can generate the CAR distribution in Example 4.8, which could not be 
generated by CARgen. To see this, using the same notation as in the example, consider 
the set of partitions V = {ni,n 2 ,n 3 } with IT = {U h A}- Let P-p(IIi) = P V (U. 2 ) = 
P-p(n 3 ) = 1/3, qu^Ui = 0, and qA^Ui = 1- It is easy to verify that for all w G W, we have 
that ^2w,u-.weu,ueu} Pv(R)qu\n = 1/3, so that the constraint (4) is satisfied. Moreover, 
direct calculation shows that, for arbitrary Py/, the distribution Pr* on runs generated by 
CARgen* with this choice of parameters is precisely the unique distribution satisfying 
CAR in this case. 

So why is CARgen* a legitimate mechanism? The key point is that all the relevant 
steps in the algorithm can be carried out by an experimenter. The parameters q and 
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qu\n for II G V and U G II are chosen before the algorithm begins; this can certainly 
be done by an experimenter. Similarly, it is straightforward to check that the equation 
(4) holds for each w G W. As for the algorithm itself, the experimenter has no control 
over the choice of w; this is chosen by nature according its distribution, Pw- However, 
the experimenter can perform steps 2.2 and 2.3, that is choosing II G V according to the 
probability distribution P-p, and rejecting the observation U with probability qu\n (since 
the experimenter knows both the sensor chosen (i.e., II) and the observation (U). 

The following theorem shows that CARgen* does exactly what we want. 

Theorem 4.9: Given a set 1Z of runs over a set W of worlds and a set O of observations, 
Pr is a distribution on 1Z that satisfies CAR if and only if there is a setting of the 
parameters in (step 1 of) CARgen* such that, for all w G W and U G O, Pr({r : 
Xw(j') = w, Xo{r) = U}) is the probability that CARgen* returns (w, U). 

5 Beyond Observations of Events 
5.1 Jeffrey Conditioning 

In the previous section, we assumed that the information received is of the form "the 
actual world is in U" . But information does not always come in such nice packages. Per- 
haps the simplest generalization of this is to assume that there is a partition {U\, . . . , U n } 
of W and the agent observes aiLq; . . . ; a n U n , where a± + ■•■ + •■■ a n — 1. This is to be 
interpreted as an observation that leads the agent to believe Uj with probability otj, for 
j = 1, . . . , n. According to Jeffrey conditioning, given a distribution P w on W, 

P W {V | aiUi, a n U n ) 
= a 1 P w {V\U 1 ) + --- + a n P w {V\U n ). 

Jeffrey conditioning is defined only if ctj > implies that Pw(Ui) > 0; if a« = and 
Pw{Ui) = 0, then aiPw(V | Ui) is taken to be 0. Clearly ordinary conditioning is the 
special case of Jeffrey conditioning where cti = 1 for some % so, as is standard, we 
deliberately use the same notation for updating using Jeffrey conditioning and ordinary 
conditioning. 

We now want to determine when updating in the naive space using Jeffrey condition- 
ing is appropriate. Thus, we assume that the agent's observations now have the form 
of ol\U\\ . . . ; a n U n for some partition {Ui, . . . , U n } of W. (Different observations may, 
in general, use different partitions.) Just as we did for the case that observations are 
events (Section 3, first paragraph), we once again assume that the agent's observations 
are accurate. What does that mean in the present context? We simply require that, 
conditional on making the observation, the probability of Ui really is CKj for i — 1, . . . , n. 
That is, for i — 1, . . . , n, we have 

Pt(X w eUi\X = a^i, a n U n ) = a { . (5) 
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This clearly generalizes the requirement of accuracy given in the case that the observa- 
tions are events. 

Not surprisingly, there is a generalization of the CAR condition that is needed to 
guarantee that Jeffrey conditioning can be applied to the naive space. 

Theorem 5.1: Fix a probability Pr on 1Z, a partition {Lq, . . . , U n } of W , and probabil- 
ities aq, . . . , a n such that a± + ■ ■ • + a n — 1. Let C be the observation ol-JJ\\ . . . ; a n U n . 
Fix some % G {1, . . . , n}. Then the following are equivalent: 

(a) If Pr(X = C) > 0, then Pr(X w = w | X = C) = Pr w (u> I ctilq; • • • ; a n U n ) for all 
w G Ui. 

(b) Pr(X G = C | X w = w) = Pr(X G = C \ X w e U) for all w E U such that Pr(X w = 
w) > 0. 

Part (b) of Theorem 5.1 is analogous to part (c) of Theorem 3.1. There are a number 
of conditions equivalent to (b) that we could have stated, similar in spirit to the conditions 
in Theorem 3.1. Note that these are even more stringent conditions than are required 
for ordinary conditioning to be appropriate. 

Examples 3.3 and 4.6 already suggest that there are not too many nontrivial scenarios 
where applying Jeffrey conditioning to the naive space is appropriate. However, just as 
for the original CAR condition, there do exist special situations in which generalized CAR 
is a realistic assumption. For ordinary CAR, we mentioned the CARgen mechanism 
(Section 4.5). For Jeffrey conditioning, a similar mechanism may be a realistic model in 
some situations where all observations refer to the same partition {Ui, . . . , U n } of W. We 
now describe a scenario for such a situation. Suppose O consists of k > 1 observations 
Ci,...,Ck with Ci = anUi, . . . ; (Xi n U n such that all Oiij > 0. Now, fix n (arbitrary) 
conditional distributions Pr,, j — 1, . . . , n, on W. Intuitively, Pr, is Pr^(- 1 Uj). Consider 
the following mechanism: first an observation Cj is chosen (according to some distribution 
Po on O); then a set Uj is chosen with probability cty (i.e., according to the distribution 
induced by Cj); finally, a world w G Uj is chosen according to Pr^. 

If the observation Cj and world w are generated this way, then the generalized CAR 
condition holds, that is, conditioning in the sophisticated space coincides with Jeffrey 
conditioning: 

Proposition 5.2: Consider a partition {U±, . . . , U n } of W and a set of k > 1 observa- 
tions O as above. For every distribution Po on O with PoiCi) > for all i G {1, . . . ,k}, 
there exists a distribution Pr on 1Z such that Po = Pro ( i- e. Po is the marginal of Pr on 
O) and Pr satisfies the generalized CAR condition (Theorem 5.1(b)) for U±, . . . , U n . 

Proposition 5.2 demonstrates that, even though the analogue of the CAR condition 
expressed in Theorem 5.1 is hard to satisfy in general, at least if the set {Lq, . . . , U n } 
is the same for all observations, then for every such set of observations there exist some 
priors Pr on 1Z for which the CAR-analogue is satisfied for all observations. As we show 
next, for MRE updating, this is no longer the case. 
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5.2 Minimum Relative Entropy Updating 



What about cases where the constraints are not in the special form where Jeffrey's 
conditioning can be applied? Perhaps the most common approach in this case is to use 
MRE. Given a constraint (where a constraint is simply a set of probability distributions— 
intuitively, the distributions satisfying the constraint) and a prior distribution Pw on W, 
the idea is to pick, among all distributions satisfying the constraint, the one that is 
"closest" to the prior distribution, where the "closeness" of P w to Pw is measured using 
relative entropy. The relative entropy between P' w and Pw [Kullback and Leibler 1951; 
Cover and Thomas 1991] is defined as 



(The logarithm here is taken to the base 2; ii P w (w) = then P w (w) log(P{ v (w) / Pw(w)) 
is taken to be 0. This is reasonable since \im x ^ox\og(x/c) = if c > 0.) The relative 
entropy is finite provided that P w is absolutely continuous with respect to Pw, in that 
if Pw{w) = 0, then P w (w) = 0, for all w G W. Otherwise, it is defined to be infinite. 

The constraints we consider here are all closed and convex sets of probability mea- 
sures. In this case, it is known that there is a unique distribution that satisfies the 
constraints and minimizes the relative entropy. Given a nonempty constraint C and a 
probability distribution Pw on W, let Pw(- I C) denote the distribution that minimizes 
relative entropy with respect to Pw- 

If the constraints have the form to which Jeffrey's Rule is applicable, that is, if they 
have the form {P w : P w (Ui) — eti, i — 1, . . . , n} for some partition {Ui, . . . , U n }, then 
it is well known that the distribution that minimizes entropy relative to a prior Pw 
is Pw(- 1 ociUi, . . . ; a n U n ) (see, e.g., [Diaconis and Zabell 1986]). Thus, MRE updating 
generalizes Jeffrey conditioning (and hence also standard conditioning). 

To study MRE updating in our framework, we assume that the observations are 
now arbitrary closed convex constraints on the probability measure. Again, we assume 
that the observations are accurate in that, conditional on making the observation, the 
constraints hold. For now, we focus on the simplest possible case that cannot be han- 
dled by Jeffrey updating. In this case, constraints (observations) still have the form 
a,\Ui] . . . ;a n U n , but now the C/j's do not have to form a partition (they may overlap 
and/or not cover W) and the Oii do not have to sum to 1. Such an observation is accu- 
rate if it satisfies (5), just as before. 

We can now ask the same questions that we asked before about ordinary conditioning 
and Jeffrey conditioning in the naive space. 

1. Is there an alternative characterization of the conditions under which MRE up- 
dating coincides with conditioning in the sophisticated space? That is, are there 
analogues of Theorem 3.1 and Theorem 5.1 for MRE updating? 
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2. Are there combinations of O and W for which it is not even possible that MRE 
can coincide with conditioning in the sophisticated space? 

With regard to question 1, it is easy to provide a counterexample showing that there 
is no obvious analogue to Theorem 5.1 for MRE. There is a constraint C such that the 
condition of part (a) of Theorem 5.1 holds for MRE updating whereas part (b) does 
not hold. (We omit the details here.) Of course, it is possible that there are some quite 
different conditions that characterize when MRE updating coincides with conditioning in 
the sophisticated space. However, even if they exist, such conditions may be uninteresting 
in that they may hardly ever apply. Indeed, as a partial answer to question 2, we now 
introduce a very simple setting in which MRE updating necessarily leads to a result 
different from conditioning in the sophisticated space. 

Let U\ and U 2 be two subsets of W such that V\ = U 1 — U 2 , V 2 = U 2 — Ui, V 3 = UiHU 2 , 
and V4 = W — (Ui U U 2 ) are all nonempty. Consider a constraint of the form C = 
a\Ui, a 2 U 2 , where a±, a 2 are both in (0, 1). We investigate what happens if we use MRE 
updating on C. Since U\ and U 2 overlap and do not cover the space, in general Jeffrey 
conditioning cannot be applied to update on C. There are some situations where, despite 
the overlap, Jeffrey conditioning can essentially be applied. We say that observation 
C = OL]U\\ a 2 U 2 is Jeffrey-like iff, after MRE updating on one of the constraints ol\U\ or 
a 2 U 2) the other constraint holds as well. That is, C is Jeffrey-like (with respect to P w ) 
if either Pw{U 2 | = a 2 or P w (Ui \ a 2 U 2 ) = a±. Suppose that Pw(U 2 \ a-JJx) = a 2 ; 

then it is easy to show that Pw(- I ol\U\) = Pw{- I Oi\U\] a 2 U 2 ). 

Intuitively, if the "closest" distribution P w to Pw that satisfies P w (Ui) = ct\ also sat- 
isfies P W (U 2 ) = a 2 , then P w is the closest distribution to Pw that satisfies the constraint 
C = a\U\\a 2 U 2 . Note that MRE updating on aU is equivalent to Jeffrey conditioning 
on all ; (1 — a)(W — U). Thus, if C is Jeffrey-like, then updating with C is equivalent to 
Jeffrey updating. 

Theorem 5.3: Given a set 7Z of runs and a set O = {Ci,C 2 } of observations, where 
Ci = otnUi] a i2 U 2 , for i = 1,2, let Pr be a distribution on 7Z such that Pr(X = C\), 
Pr(X G = C 2 ) > 0, andPr w (w) = Pr(X w = w) >0forallw £ W . LetW = Pr(- 1 X Q = 
Ci), and let Pr^ be the marginal o/Pr* on W . If either C\ or C 2 is not Jeffrey-like, then 
we cannot have Pr^/ = Pr^(- | Ci), for both i = 1,2. 

For fixed U\ and U 2 , we can identify an observation a\Ui\ a 2 U 2 with the pair (cti, a 2 ) G 
(0, l) 2 . Under our conditions on U\ and U 2 , the set of all Jeffrey-like observations is 
a subset of (Lebesgue) measure of this set. Thus, the set of observations for which 
MRE conditioning corresponds to conditioning in the sophisticated space is a (Lebesgue) 
measure set in the space of possible observations. Note however, that this set depends 
on the prior Pw over W. 

A result similar to Theorem 5.3 was proved by Seidenfeld [1986] (and considerably 
generalized in [Dawid 2001]). Seidenfeld shows that, under very weak conditions, MRE 
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updating cannot coincide with sophisticated conditioning if the observations have the 
form "the conditional probability of U given V is a" (as is the case in the Judy Benjamin 
problem). Theorem 5.3 shows that this is impossible even for observations of the much 
simpler form aiU±; 012U2, umess we can reduce the problem to Jeffrey conditioning (in 
which case Theorem 5.1 applies). 

6 Discussion 

We have studied the circumstances under which ordinary conditioning, Jeffrey condition- 
ing, and MRE updating in a naive space can be justified, where "justified" for us means 
"agrees with conditioning in the sophisticated space". The main message of this paper 
is that, except for quite special cases, the three methods cannot be justified. Figure 1 
summarizes the main insights of this paper in more detail. 

As we mentioned in the introduction, the idea of comparing an update rule in a 
"naive space" with conditioning in a "sophisticated space" is not new; it appears in 
the CAR literature and the MRE literature (as well as in papers such as [Halpern and 
Tuttle 1993] and [Dawid and Dickey 1977]). In addition to bringing these two strands of 
research together, our own contributions are the following: (a) we show that the CAR 
framework can be used as a general tool to clarify many of the well-known paradoxes 
of conditional probability; (b) we give a general characterization of CAR in terms of a 
binary- valued matrix, showing that in many realistic scenarios, the CAR condition cannot 
hold (Theorem 4.4); (c) we define a mechanism CARgen* that generates all and only 
distributions satisfying CAR (Theorem 4.9); (d) we show that the CAR condition has a 
natural extension to cases where Jeffrey conditioning can be applied (Theorem 5.1); and 
(e) we show that no CAR-like condition can hold in general for cases where only MRE 
(and not Jeffrey) updating can be applied (Theorem 5.3). 

Our results suggest that working in the naive space is rather problematic. On the 
other hand, as we observed in the introduction, working in the sophisticated space (even 
assuming it can be constructed) is problematic too. So what are the alternatives? 

For one thing, it is worth observing that MRE updating is not always so bad. In 
many successful practical applications, the "constraint" on which to update is of the form 
\ Yh=\ Xi — t for some large n, where X; is the ith outcome of a random variable X on 
W. That is, we observe an empirical average of outcomes of X. In such a case, the MRE 
distribution is "close" (in the appropriate distance measure) to the distribution we arrive 
at by sophisticated conditioning. That is, if Pr" = Pr^(- | E(X) = t), Pr' = Pr(- | Xo =< 
\ Y^l=\ Xi — t)), and Q n denotes the n-fold product of a probability distribution Q, then 
for sufficiently large n, we have that (Pr") n ks (Pr' w ) n [van Campenhout and Cover 1981; 
Griinwald 2001; Skyrms 1985; Uffink 1996]. Thus, in such cases MRE (almost) coincides 
with sophisticated conditioning after all. (See [Dawid 2001] for a discussion of how this 
result can be reconciled with the results of Section 5.) 
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Figure 1: Conditions under which updating in the naive space coincides with conditioning 
in the sophisticated space. 

But when this special situation does not apply, it is worth asking whether there 
exists an approach for updating in the naive space that can be easily applied in practical 
situations, yet leads to better, in some formally provable sense, updated distributions 
than the methods we have considered? A very interesting candidate, often informally 
applied by human agents, is to simply ignore the available extra information. It turns 
out that in many situations this update rule behaves better, in a precise sense, than the 
three methods we have considered. This will be explored in future work. 

Our discussion here has focused completely on the probabilistic case. However, these 
questions also make sense for other representations of uncertainty. Interestingly, in 
[Friedman and Halpern 1999], it is shown that AGM-style belief revision [Alchourron, 
Gardenfors, and Makinson 1985] can be represented in terms of conditioning using a 
qualitative representation of uncertainty called a plausibility measure; to do this, the 
plausibility measure must satisfy the analogue of Theorem 3.1(a), so that observations 
carry no more information than the fact that they are true. No CAR-like condition is 
given to guarantee that this condition holds for plausibility measures though. It would 
be interesting to know if there are analogues to CAR for other representations of uncer- 
tainty, such as possibility measures [Dubois and Prade 1990] or belief functions [Shafer 
1976]. 
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A Proofs 

In this section, we provide the proofs of all the results in the paper. For convenience, we 
restate the results here. 

Theorem 3.1: Fix a probability Pr on 1Z and a set U C W . The following are 
equivalent: 

(a) If Pr(X = U) > 0, then Pr(X w = w | X = U) = Pr(X w = w\X w e U) for all 
w elf. 

(b) The event Xw = w is independent of the event Xo = U given Xw G U , for all 
w elf. 

(c) Pr(X = U | X w = w) = Pr(X = U\X w eU) for all w G U such that Py{X w = 
w) > 0. 

(d) Pr(X = U\X W = w) = Pr(X = U\X W = w') for all w,w' G U such that 
Pr(X w = w)>0 and Pr(X w = w') > 0. 

Proof: Suppose (a) holds. We want to show that Xw = w and Xo = U are independent, 
for all w EU . Fix w &U . If Pr(Xo = U) = then the events are trivially independent. 
So suppose that Pr(X G = U) > 0. Clearly 

Pr(X^ = w | X = U n X w G U) = Pr(X w = W \X = U) 

(since observing U implies that the true world is in U). By part (a), 

Pr(X w = w | X = U) = Pr(X w = w \X w eU). 

Thus, 

Pi(X w = w\x v = unx w eU) = Pr(X w =w\X w G U), 

showing that Xw = w is independent of Xo = U, given Xw G U . 

Next suppose that (b) holds, and w G U is such that Pr(Xy^ = w) > 0. From part 
(b) it is immediate that Pr(X = U | X w = w n X w G U) = Pr(X = U\X W G U). 
Moreover, since w G U, clearly Pr(X G = U \ X w = w n X w G U) = Pr(X G = U \ X w = 
w). Part (c) now follows. 



26 



Clearly (d) follows immediately from (c). Thus, it remains to show that (a) follows 
from (d). We do this by showing that (d) implies (c) and that (c) implies (a). So 
suppose that (d) holds. Suppose that Pr(X G = U \ X w = w) = a for all w G U such that 
Pr(Xy/ — w) > 0. From the definition of conditional probability 

Pr(X G = U\X w eU) 
— J2{weU:Pv(x w =w)>o} P r (-^o = U n X w = w)l Pr(X w G U) 
= E{weU:Pv(x w =w)>o} Pr(^ = U\X w = w) Pr(X w = w)/ Pr(X w G U) 
= J2{weu-.Pv(x w =w)>o} a ¥?{Xw = w)/ Pi{X w G U) 
= a 

Thus, (c) follows from (d). 

Finally, to see that (a) follows from (c), suppose that (c) holds 
Pr(X w = w) = 0, then (a) is immediate, so suppose that Pr(X w ■ 
(c) and the fact that X Q — U C X w G U, we have that 

Pr(X w = W \X = U) 

= Pr(X G = U\X w = w) Pv{X w = w)/ Pr(X G = U) 

= Pr(X = U | X w G U) Pt(X w = w)/ Pr(X G = U) 

= Pr(X G = U n X w G U) Pi{X w = w)/ Pt(X w G U) Pr(X G = U) 

= Pr(X = U) Pr(X w = «;)/ Pi(X w G C/) Pr(X = U) 

= Pr(X w = w)/Pr(X w G U) 

= Pt(X = w\X w eU), 

as desired. I 

Proposition 4.1: T/ie CAR condition holds for all distributions Pr on 1Z if and only 
if O consists of pairwise disjoint subsets ofW. 

Proof: First suppose that the sets in O are pairwise disjoint. Then for each probability 
distribution Pr on 1Z, each U G O, and each world w G U such that Pr(X w = w) > 0, 
it must be the case that Pr(X = U\X W — w) — 1. Thus, part (d) of Theorem 3.1 
applies. 

For the converse, suppose that the sets in O are not pairwise disjoint. Then there 
exist sets U,U' G O such that both U — U' and U fl U' are nonempty. Let w G U C\U'. 
Clearly there exists a distribution Pr on 1Z such that Pr(X = U) > 0, Pr(X G = U') > 0, 
Pr(X w = w | X = U) = 0, Pr(X w = w I X Q = U') > 0. But then Pr(X w = w \X w e 
U)>0. Thus 

Pi(X w = wo\X = U)^ Pi(X w = w | X w G C/), 
and the CAR condition (part (a) of Theorem 3.1) is violated. | 

Lemma 4.3: Let 1Z be the set of runs over observations O and worlds W , and let S be 
the CARacterizing matrix for O and W . 



. If w G U is such that 
= w) > 0. Then, using 
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(a) Let Pr be any distribution over 7Z and let S' be the matrix obtained by deleting from 
S all rows corresponding to an atom A with Pr(Xyy G A) — 0. Define the vector 
7 = (71, . . . , 7„) fry settmp 7,- = Pr(X = £/,- | X w G Uj) if Pr{X w G L^) > ; and 
7j = otherwise, for j = 1, . . . , n. // Pr satisfies CAR, then S' ■ 7 T = 1 T . 

Let S" &e a matrix consisting of a subset of the rows of S, and let Vw,S' be the set 
of distributions over W with support corresponding to S' ; i.e., 

Pw,S' — {P\v I Pw{A) > iff A corresponds to a row in S'}. 

If there exists a vector 7 > such that S' ■ 7 T = V~ , then, for all Pw G Vw,S', there 
exists a distribution Pr over 1Z with P?w = Pw (i.e., the marginal of Pr on W is 
Pw) such that (a) Pr satisfies CAR and (b) Pr(Xo = Uj \ Xw G Uj) = 7,- /or a// j 
Pr(X^ G C/j) > 0. 

Proof: For part (a), suppose that Pr is a distribution on 7Z that satisfies CAR. Let k 
be the number of rows in S', and let aij = Pr(Xw G ij), for % — 1, . . . , k, where A4 is the 
atom corresponding to the ith row of S'. Note that at > for i = 1, . . . , k. Clearly, 

£ Pr(X = ^|X w G^) = l. (6) 

It easily follows from the CAR condition that 

Pr(X G = Uj I X w e At) = Pr(X G = Uj \ X w G Uj) 
for all Ai C Uj, so (6) is equivalent to 

J2 Pi(X = U j \X w eU j ) = l. (7) 

(7) implies that J2{j:AiCUj} 7j = 1 for i = 1, . . . , /c. Let be the row in S" corresponding 
to Ai. Since has a 1 as its jth component if Ai C C/j and a otherwise, it follows that 
s. i ■ 7 T = 1 and hence S' ■ 7 T = 1 T . 

For part (b), let k be the number of rows in S', let s\, . . . , s k be the rows of S', and 
let Ai, . . ., A k be the corresponding atoms. Fix P w G 7-V,s, an d set — -fV(A) f° r 
« = 1, . . . , fc. Let Pr be the unique distribution on 1Z such that 

Pr(Xw £ Aj) = «i, for i — 1, . . . , k, 

Pi(X w eA) = 0i£AeA-{A 1 ,...,A k }, 

{ i- if A- G U- ^ ' 

Note that Pr is indeed a probability distribution on TZ, since J2a€A^ v (-^w € A) = \, 
Py(X w G Aj) > for % — 1, . . . , k, and, since we are assuming that S' ■ 7 T = l 7 , 

£ Pr(^o = ^ I AV G A) = ^ ■ f = 1, 
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for % — 1, . . . , k. Clearly Pr^ = Pw- It remains to show that Pr satisfies CAR and that 
7j = Pr(X G = Uj \X W G Uj). Given j G {1, ...,n}, suppose that there exist atoms 
Ai> corresponding to rows and sV of S' such that A^A^ G £/j. Then 

Pr(X G = C^|X W G A) = Pr(X G = C^|X W G ^) = 7j , 

It now follows by Theorem 3.1(c) that Pr satisfies the CAR condition for U\,...,U n . 
Moreover, Theorem 3.1(d), it must be the case that Pr(X = Uj | X w G Uj) — 7^. I 

The proof of Theorem 4.4 builds on Lemma 4.3 and the following proposition, which 
shows that the condition of part (b) of Theorem 4.4 is actually stronger than the condition 
of part (a). It is therefore not surprising that it leads to a stronger conclusion. 

Proposition A.l: If there exists a subset R of rows of S that is linearly dependent 
but not affinely dependent, then for all IZ-atoms A corresponding to a row in R and all 
j* G {1, . . . ,n}, if A C Uj*, there exists a vector u that is an affine combination of the 
rows in R such that Uj > for all j G {1, . . . , n} and Uj* > 0. 

Proof: Suppose that there exists a subset R of rows of S that is linearly dependent but 
not affinely dependent. Without loss of generality, let v\, . . . , Vk be the rows in R. There 
exist Ai, . . . , Afc such that k = Yh=i K an d X^Li — 0- We first show that in fact 
every row v in R is an affine combination of the other rows. Fix some j G {1, . . . , k}. 
Let (j, j = (Xj — X)i=i K) — ~ Ysi^j \ an d let //j = Aj for i ^ j. Then Y%=i fa = and 

k k k 

faVi = - Y = -KVj. 

i=l i=l i=l 

For i — 1, . . . , k, let /4 = -fa/n. Then ^=1 fa[ = and Y$=i fa\Vi = Vj- Now if Ai C Uj* 
for some % = 1, . . . , k and some j* = 1, . . . , n, then Vi has a 1 as its j*th component. 
Also, Vi is an affine combination of the rows of R with no negative components, so v,i is 
the desired vector. | 

Theorem 4.4: Let 71 be a set of runs over observations O = {U±, . . . , U n } and worlds 
W , and let S be the CARacterizing matrix for O and W . 

(a) Suppose that there exists a subset R of the rows in S and a vector u = (ui, . . . , u n ) 
that is an affine combination of the rows of R such that Uj > for all j G {1, . . . , n} 
and Uj* > for some j* G {1, . . . , n}. Then there is no distribution Pr on 1Z that 
satisfies CAR such that Pr(Ao = Uj*) > and Pr(Xw G A) > for each 71- atom 
A corresponding to a row in R. 

(b) If there exists a subset R of the rows of S that is linearly dependent but not affinely 
dependent, then there is no distribution Pr on 7Z that satisfies CAR such that 
Pr(X w G A) > for each TZ-atom A corresponding to a row in R. 
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(c) Given a set R consisting of n linearly independent rows of S and a distribution 
Pw on W such that Pyy(A) > for all A corresponding to a row in R, there is a 
unique distribution Pq on O such that if Pr is a distribution on 1Z satisfying CAR 
and Pi{Xw E A) = Pw(A) for each atom A corresponding to a row in R, then 
Pt(Xo = U) = Po(U). 

Proof: For part (a), suppose that R consists of V\, . . . , v^, corresponding to atoms 
Ai, . . . , A k . By assumption, there exist coefficients Ai, . . . , A& such that X)f=i \ — 0) an d 
a vector u = X^=i AjWj such that every component of u is nonnegative. Suppose, by way of 
contradiction, that Pr satisfies CAR and that ai = Pr(Xw £ A) > for i E {1, . . . , k}. 
By Lemma 4.3(a), we have 

(k \ k k 

^A^i •7 = ^A i (^-7) = ^A i = 0, (9) 
i=i / i=i i=i 

where 7 is defined as in Lemma 4.3. For j = l,...,n, if Pr(X = Uj) > then 
Pr(X G = Uj DX W E Uj) = Pr(X G = Uj) > and Pr(X w E Uj) > 0, so ^ > 0. By 
assumption, all the components of u and 7 are nonnegative. Therefore, if there exists j* 
such that Pr(X G = Uj*) > and Uj* > 0, then u ■ 7 > 0. This contradicts (9), and part 
(a) is proved. 

For part (b), suppose that there exists a subset R of rows of S that is linearly de- 
pendent but not affinely dependent. Suppose, by way of contradiction, that Pr satisfies 
CAR and that Pr(A^ e A) > for all atoms A corresponding to a row in R. Pick an 
atom A* corresponding to such a row. By Proposition A.l and Theorem 4.4(a), we have 
that Pr(X = Uj.) = for all j* such that A* e Uj*. But then Pr(X w G A*) = 0, and 
we have arrived at a contradiction. 

For part (c), suppose that R consists of the rows v±, . . . ,v n . Let S' be the n x n 
submatrix of S consisting of the rows of R. Since these rows are linearly independent, a 
standard result of linear algebra says that S' is invertible. Let Pr be a distribution on 
1Z satisfying CAR. By Lemma 4.3(a), S"7 = 1 T . Thus, 7 = (S") _1 l. For j = 1, . . . ,n 
we must have jj = (3j/Pr(Xw £ Uj), where (5j = Pr(Xo = Uj). Given Piw(A) for each 
atom A, we can clearly solve for the /5j's. I 

Theorem 4.9: Given a set 71 of runs over a setW of worlds and asetO of observations, 
Pr is a distribution on 1Z that satisfies CAR iff there is a setting of the parameters in 
CARgen* such that, for all w E W and U E O, Pr({r : X w (r) = w, X Q (r) = U}) is 
the probability that CARgen* returns (w,U). 

Proof: First we show that if Pr is a probability on TZ such that, for some setting of 
the parameters of CARgen*, Pr({r : Xyy(r) = w, Xq{t) = U}) is the probability 
that CARgen* returns (w,U), then Pr satisfies CAR. By Theorem 3.1, it suffices to 
show that, for each set U E O and worlds wi,W2 E U such that Pr(Xw = w±) > 
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and Pr(X w = w 2 ) > 0, we have Pr(X G = U \ X w = Wl ) = Pr(X G = U\X W = 
w 2 ). So suppose that Wi,w 2 G U, Pr(X w = w±) > 0, and Pr(X w = w 2 ) > 0. Let 
au = Z){neP:C/en} -Pp(n)(l — Qu\n)- Intuitively, ajj is the probability that the algorithm 
terminates immediately at step 2.3 with (w, U) conditional on some w G U being chosen 
at step 2.1. Notice for future reference that, for all w, 

E <*u= E Pp(n)(l- gc ,|n) = l-g, (10) 

{u-.weu} {(u,uy.u&u,weu} 

where q is defined by (4). As explained in the main text, for both % — 1,2, q is the 
probability that the algorithm does not terminate at step 2.3 given that Wi is chosen in 
step 2.1. It easily follows that the probability that (wi, U) is output at step 2.3 is 

P w (wi)au(l + q + q 2 H ) = P w (w i )a u /{\ - q). 

Thus, Pr(X H / = Wi fl X — U) — Pw{wi)au / '(I — q). Using (10), we have that 

Pr(X w = Wl )= E Pr(AV = Wl nX = U) = P ^ l) E ot v = P w {w t ). 

Finally, we have that Pr(X G = U \ X w = Wi) — au/(l — q), for i — 1, 2. Thus, Pr satisfies 
the CAR condition. 

For the converse, suppose that Pr satisfies the CAR condition. Let O = {Ui, . . . ,U n }. 
We choose the parameters for CARgen* as follows. Set Py/{w) = Pr(X^ = w) and let 
fa = Pr(Xo = Ui). Without loss of generality, we assume that $ > (otherwise, take 
O' to consist of those sets that are observed with positive probability, and do the proof 
using O'). 

For i = 1, . . . ,n, let Ui = {Ui,Ui}. Set P P (IL) = Pr(X G = Ui) = A and ^. |n . = 1. 
(Thus, the set Ui is always rejected, unless Ui = Uj.) Since Pr(A^ G Uj) > Pr(X = 
Uj) > by assumption, it must be the case that e = min" =1 Pr(Xw G Uj) > 0. Now set 
51/ilHi = l-e/Pr(X w G Ui). 

We first show that, with these parameter settings, we can choose q such that constraint 
(4) is satisfied. Let q w = J2{u,n-. w&u, ueu} Pp(R)Qu\n- F° r eacri w G W such that Pwiw) > 
0, we have 

q w 

— S{i/,n: weu,uen} Pp(R)Qu\n 

= Yh=\ E{(7: weu,ueUi} Pp0^i)Qu\Ui 

= Y,{i:weU,} Pv(^U)qUi\IU + T,{ i:we u i } Pv(^-i)QUi\Ui- 

The last equality follows because II = {Ui, Ui}. Thus, for a fixed i, J2{u-. weu,ueTii} Pp(^h)Qu\iii 
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is either P-piJl^qu.^. if w E U, or P-piPl^qjj.^. if w E U i- It follows that 

q w 

= E {i : we u i} Pr(X G = U t )(l - e/ Pr(X w E U t )) + £ { ^t/ i} Pr(X G = 
= E?=i Pr(X G = C/,) - e E { i-. w eu i} ^(X Q = U i \X w e U t ) 
= 1 — eJ2{i:weUi} P r (-^o = Ui | = w) [since Pr satisfies CAR] 
= 1 - e. 

Thus, qv, = if Pvk( w )> Pw( w ') > 0, so these parameter settings are appropriate for 
CARgen* (taking q = q w for any w such that iV(tu) > 0). Moreover, e — l — q. 

We now show that, with these parameter settings, Pr(Xw — w H Xo = £/) is the 
probability that CARgen* halts with (w,U), for all w G W and U E O. Clearly if 
Pr(X w = w) = 0, this is true, since then Pr(X w = wnX Q = U) = 0, and the probability 
that CARgen* halts with output (w,U) is at most Pw{w) = Pr(X w — w) — 0. So 
suppose that Pr(X w = w) > 0. Then it suffices to show that Pr(X = U | X w = w) is 
the probability that (w, Ui) is output, given that w is chosen at the first step. But the 
argument of the first half of the proof shows that this probability is just But 

1-q 

= ^ [since e—l — q] 

E{nep ; ^en} p y( n )( 1 -'?^|n) 
= /3i(e/PT(X w £Ui)) 

= Pr(X e =Ui)/Pr(X w E Ui) 

= Pr(X = U | X w = w) [since Pr satisfies CAR], 

as desired. I 

Theorem 5.1: Fix a probability Pr on 1Z, a partition {U\, . . . , U n } of W , and proba- 
bilities an, ■ ■ ■ ,OL n such that aii + • — h a n — 1. Let C be the observation a\U\\ . . . ; a n U n . 
Fix some % E {1, . . . , n}. Then the following are equivalent: 

(a) If Pr(X = C) > 0, then Pr(X w = w \ X = C) = Pr w (w \ aiUi, a n U n ) for all 
w E U. 

(b) Pr(X G = C\X w = w) = Pr(X G = C\X w EUi) for all w E U such that Pr(X w = 
w) > 0. 

Proof: The proof is similar in spirit to that of Theorem 3.1. Suppose that (a) holds, 
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w eUi, and Pr(X w = w) > 0. Then 



Pr(X = C | X w = w) 

Pr(X w = w\X = C) Pr(X = C) / Pi{X w = w) 
Pv w {w | axUx; . . . ; a n C/ n ) Pr(X = C)/ Pr(X w = w) 
a, Pr w (u> | C/0 Pr(X G = C)/ Vi w {w) 
a i Pr(X = C)/Pr w (U i ) 



Similarly, 



Pr(X G = C\X w eU i ) 
= Pr(X w G Ui I X = C) Pr(X G = C)/ Pi{X w e U t ) 
= E w ' eUi V*w{W \a 1 U 1 -...- a n U n ) Pr(X G = C)/ Pi{X w £ Ui) 
= Ew'eu, <*i Pr^K I Ui) Pr(X G = C)/ Pr{X w G U t ) 
= a l Pr(X = C)/Pr w (U l ) 

Thus, Pr(X G = C | X w = w) = Pr(X G = C | X w e U) for all w e U such that 
Pr(X w = w) > 0. 

For the converse, suppose that (b) holds and Pr(Xo = C) > 0. Given u> e Ui, if 
Pr(X w = w) = 0, then (a) trivially holds, so suppose that Pr(r(X w — w) > 0. Suppose 
that w G C/j. Clearly Pi(w \ aiU\, . . . ; = aiPr w (w \ Ui). Now, using (b), we have 

that 

Pr(X w =w\X = C) 
= Pr(X = C\X w = w) Pr{X w = w)/ Pr(X = C) 
= Pr(X = C | X w G Ui) Pt(X w = w)/ Pr(X G = C) 
= Pr{X w G Ui | X = C) Pr(X w = w)/ Pr{X w G U) 
= Oii Pt w (w I Ui) [using (5)]. 

Thus, (a) holds. I 

Proposition 5.2: Consider a partition {U±, . . . , U n } of W and a set of k > 1 obser- 
vations O = {C\, . . . , Ck} with Ci = anUi] . . . ; ai n U n such that all ctij > 0. For every 
distribution Po on O with Po{Ci) > for all i G {1, . . . ,k}, there exists a distribution 
Pr on 1Z such that Po = Pro ( i- e. Po is the marginal of Pr on O) and Pr satisfies the 
generalized CAR condition (part (b) of Theorem 5.1) for U\, . . . , U n . 

Proof: Given a set W of worlds, a set O = {C±, . . . , C^} of observations with distribution 
Po satisfying P {Ci) > for i G {l,...,k}, and arbitrary distributions Pr^ on Uj, 
j = l,...,n, we explicitly construct a prior Pr on 1Z that satisfies CAR such that 
Po = Pro, where Pr is the marginal of Pr on O and Pr, = Prw{- \Uj). 
Given w G Uj, define 

Pr({r G K : X Q (r) = d,X w (r) = w}) = P (C7 i )a ij Pr j («;). 
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(How the probability is split up over all the runs r such that X Q (r) = C; L and X w (r) = w 
is irrelevant.) It remains to check that Pr is a distribution on 1Z and that it satisfies all 
the requirements. It is easy to check that 

n 

Pr(X G = d) = E Po(QK-Pr» = P (Q). 
j=i weUj 

It follows that Yh=i P r (Xo = Ci) = 1, showing that Pr is a probability measure and Po 
is the marginal of Pr on O. If w G C/,-, then 

Pr w (w | Uj) = Pi w (w)/Pi w {Uj) 

i2'l=i Pl o(C i )ai j Pi^w) 
E^Xti^C^Pr.K) 

PrH^^ProtCiK 
(E„' eUj - Pr i( w ')) Eti P'o(Ci)a« 

= Pr iH- 

Finally, note that, for j G {1, . . . , n}, for all u> G C/j such that Pr(X w = w) > 0, we have 
that 

Pr(X G = d\X w = w) 

Pr (Ci)aij PTj(w) 

Etl Pr o(^)^' Pr iW 
= Pro(q)tt, 3 

Eti Pr o(QK, 

_ Pr (Ci)a ij -Pr(A- t ygt/ J -) 



£*Pr Prpf^et/,) 

p r (x =c i nx w gt/ i ) 

~~ Pr(X w 6!7j) 

= Pr(X G = Q | X w G Uj) 



so the generalized CAR condition holds for {Ui, . . . ,U n }. I 



To prove Theorem 5.3, we first need some background on minimum relative entropy 
distributions. Fix some space W and let U\, . . . , U n be subsets of W. Let A be the set of 
(eti, . . . , a„) for which there exists some distribution Pyy with Py/(U,j) = a.i for % — 1, . . . , n 
and Pwiw) > for all w G W. Now let Pw be a distribution with P w {w) > for all 
w eW. Given a vector /3 = . . . , /3 n ) G R ra , let 

P^( W ) = L e PlluA™)+-+Pnlu n (w)p w ( w ^ 
Zj 

where ljj is the indicator function, i.e. lu(w) — 1 if w G U and otherwise, and 

z = Ewew ePllui{w)+ "' +PnlUn{w)p w(w) is a normalization factor. Let a t = P^{Ui) for 
« = 1, . . . ,n. By [Csiszar 1975, Theorems 2.1 and 3.1], it follows that 

P w (-\a 1 U 1 ;...;a n U n ) = P^; (11) 
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Moreover, for each vector (oi, . . . , a n ) G A, there is a vector (5 = (fa, . . . , (3 n ) G R n such 
that (11) holds. (For an informal and easy derivation of (11), see [Cover and Thomas 
1991, Chapter 9].) 

Lemma A. 2: Let C = ol-JJ\\ . . . ; a n ll n for some (a±, . . . , a n ) G A. Let ((3±, . . . , f3 n ) be a 
vector such that (11) holds for Oi, ...,«„. If Pi = for some i G {1, . . . , n}, then 

P w (Ui | aiUi, . . . ; (Xi-iUi-i; a i+1 U i+1 ; . . . ; a n U n ) = a { . 

Proof: Without loss of generality, assume that ft = 0. Taking a[ = P^{U % ) for i = 
2, . . . , n, it follows from (11) that 

P w (w | a' 2 U 2 ; . . . ; a' n U' n ) = ^e^ + - + ^ P w (w), 

so that 

P w (- | aiC/i; . . . ; o n C/ n ) = P w (- | o 2 C/ 2 ; • • • ; a' n U n ). 

Since Pw(Ui \ a' 2 U 2 ; a ' n U n ) = ol[ and Pw{Ui \ ct-JJ-y] . . . a n U n ) — for i — 2, . . . , n, we 
have that Oj = o^ for i = 2, . . . , n. Thus, iV(' I 02^2; • • • ; o„C/ n ) = Pw{- I Oit/i; • • • ; ci n U n ) 
and, in particular, 

"i = P\v(Ui I aiC/i; . . . ; o„C/ n ) = P\v{U\ \ a 2 U 2 ; a n ll n ). 

I 

Theorem 5.3: Given a set 7Z of runs and a set O = {Ci, C 2 } of observations, where 
Ci = anUi, a i2 U 2 , for i — 1,2, iei Pr 6e a distribution on 7Z such that Pr(X G = C\), 
Pr(X G = C 2 ) > 0, and Vi w {w) = Pr(X w = w) > /or all w G W. Let Pr* = Pr(- | X Q = 
Ci), and let Pr l w be the marginal o/TV on W . If either C\ or C 2 is not Jeffrey-like, then 
we cannot have Pr^ = Pr w (- | Ci), for both i = 1,2. 

Proof: Let V x = U x - U 2 , V 2 = U 2 - U 1} V 3 = U x D U 2 , and V 4 = W - (E/i U C/ 2 )- Since 
V2, V3, V4 are all assumed to be nonempty, we have A = (0, l) 2 , where A is defined 
as above, that is, A is the set (0:1,0:2) such that there exists a distribution Pw with 
P W (U{) = 01, P W {U 2 ) = o 2 , P w (w) > for all w G W. If Pr^ = Pr w (- 1 Q) for i = 1, 2, 
then 

APr w (.|d) + (1 - A)Pr w (- 1 C 2 ) = Pr w , (12) 

where A = Pr(Xo = C±). We prove the theorem by showing that (12) cannot hold if 
either C\ or C 2 is not Jeffrey-like. Since we have assumed that (an,ai 2 ) G (0, l) 2 = A 
for i — 1,2, we can apply (11) to C, for i — 1,2. Thus, there are vectors (fin, j3i 2 ) G R 2 
for i = 1, 2 such that, for all w G W, 

Pr^^lCi) = ^e ftll ^ +fel ^Pr w (w). (13) 
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(13) implies that Pr PF (\/ 1 |C i ) = Z^e^ Pr w (Vi), Pr w (y 2 |C i ) = Pr w (\/ 2 ), Pr VK (\/ 3 |C i ) 

^-i e fti+fep rH/ (^/ 3 ) j p rw (y 4 |Ci) = Zr 1 Pr w (\/ 4 ). Plugging this into (12), we obtain the 



following four equations: 




















(Vi) = 


A— PrwiVi) + (1 


-A)— Piv(Vi) 

^2 






p j3l2 






Prw 


(V 2 ) = 


\—Pr w (V 2 ) + (1 


-A)— Pr w (F 2 ) 

^2 






p/3ll+/?21 








(Vs) = 


\ z ' 


-Pr w (^ 3 ) + (1-A)— — Pr 

^2 




(Vd = 


AttPiv 

^1 


(Vi) + (i - 


- A)— Pr w (F 4 ). 

^2 



(14) 

Since we have assumed that Pr(u>) > for all w G W, it must be the case that Prw(Vi) > 
0, for % — 1, . . . , 4. Thus, Pr(V^) factors out of the ith equation above. By the change of 
variables /i = X/Z±, 1 — /x = (1 — \)/Z 2 , e„ = e^ 3 — 1 and some rewriting, we see that 
(14) is equivalent to 

= ne u + (1 - n)e 2 \ 
= /jL6 12 + (1 - /i)e 22 

= //(en + ei2 + enei 2 ) + (1 -//)(e 2 i + e 22 + e 2 ie 22 ). (15) 

If, for some %, both e^i and Cj 2 are nonzero, then the three equations of (15) have no 
solutions for \i G (0, 1). Equivalently, if for some i, both f3 a and (5 i2 are nonzero, then 
the four equations of (14) have no solutions for A G (0, 1). So it only remains to show 
that for some i, both flu and fli 2 are nonzero. To see this, note that by assumption for 
some %, Ci is not Jeffrey-like. But then it follows from Lemma A. 2 above that both flu 
and Pi 2 are nonzero. Thus, the theorem is proved. I 
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